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Abstract 

Comprehensive inventories of plant viral diversity are essential for effective quarantine and sanitation efforts. The safety of 
regulated plant material exchanges presently relies heavily on techniques such as PCR or nucleic acid hybridisation, which 
are only suited to the detection and characterisation of specific, well characterised pathogens. Here, we demonstrate the 
utility of sequence-independent next generation sequencing (NGS) of both virus-derived small interfering RNAs (siRNAs) 
and virion-associated nucleic acids (VANA) for the detailed identification and characterisation of viruses infecting two 
quarantined sugarcane plants. Both plants originated from Egypt and were known to be infected with Sugarcane streak 
Egypt Virus (SSEV; Genus Mastrevirus, Family Geminiviridae), but were revealed by the NGS approaches to also be infected by 
a second highly divergent mastrevirus, here named Sugarcane white streak Virus (SWSV). This novel virus had escaped 
detection by all routine quarantine detection assays and was found to also be present in sugarcane plants originating from 
Sudan. Complete SWSV genomes were cloned and sequenced from six plants and all were found to share >91% genome- 
wide identity. With the exception of two SWSV variants, which potentially express unusually large RepA proteins, the SWSV 
isolates display genome characteristics very typical to those of all other previously described mastreviruses. An analysis of 
virus-derived siRNAs for SWSV and SSEV showed them to be strongly influenced by secondary structures within both 
genomic single stranded DNA and mRNA transcripts. In addition, the distribution of siRNA size frequencies indicates that 
these mastreviruses are likely subject to both transcriptional and post-transcriptional gene silencing. Our study stresses the 
potential advantages of NGS-based virus metagenomic screening in a plant quarantine setting and indicates that such 
techniques could dramatically reduce the numbers of non-intercepted virus pathogens passing through plant quarantine 
stations. 
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Introduction 

When attempting to prevent the spread of plant diseases, 
comprehensive inventories of viral diversity are fundamental both 
for effective quarantine and sanitation efforts, and to ensure that 
plant materials within biological resource centres (BRCs) can be 
safely distributed [1,2]. Detection of pathogens is one of the most 
critical quarantine and BRC operations. Ideally, the tools used for 
this purpose must be both sensitive enough to accurately detect the 
presence of even extremely low amounts of pathogen nucleic acids 
or proteins, and provide sufficient specific information to identify 
the genetic variants/ strains of whatever pathogens are present. 



The major challenge of using classical nucleic acid sequence- 
informed detection tools such as polymerase chain reaction (PCR) 
or Southern hybridisation assays, is that despite being highly 
sensitive, these techniques are generally either species or, at best, 
genus-specific. In addition, such tools lack the capacity to detect, 
let alone identify, pathogens that are unknown, poorly character- 
ized or highly variable. Although it might be argued that the most 
economically important pathogens tend to be well characterized 
and that it is therefore not a serious issue that many of the more 
obscure pathogens go undetected, it is becoming better appreci- 
ated that the "importance" of any particular pathogenic microbe 
is very difficult to define. Specifically, the environmental and 
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economic impacts of a particular pathogen can vary widely with 
varying climatic and ecological conditions and there are large 
numbers of microbes that are presently not classified as pathogens 
(or at least which are not noticeably pathogenic to humans or to 
domesticated plants and animals), which will eventually emerge as 
important future pathogens [3]. Also, since non-domesticated 
plant and animal species and countless numbers of microbes which 
contribute to natural terrestrial ecosystems [4—6] can also 
potentially be threatened by exotic pathogens, the unconstrained 
global dissemination of apparendy harmless fungi, viruses and 
bacteria could have serious environmental and economic impacts. 

Whereas "sequence-dependent" microbial detection methods, 
which are generally based on PGR or nucleic acid hybridisation, 
can only be used to target known pathogens, sequence-indepen- 
dent next generation sequencing (NGS) based approaches can 
potentially provide an ideal platform for identifying almost all 
known and unknown microbes present in any particular host 
organism [5,7-9]. Coupled with innovative sample processing 
procedures, "metagenomics" applications of NGS [10] have 
already enabled the identification of novel pathogens through 
the rapid and comprehensive characterization of microbial strains 
and isolates within environmental and host tissue samples [9,1 1]. 

In addition to numerous applications in the study of animal 
infecting viruses, NGS-based metagenomics approaches have also 
been used to detect plant infecting viruses [12]. Three main classes 
of nucleic-acids have been targeted by such analyses: (1) virion- 
associated nucleic acids (VANA) purified from viral particles 
[13,14]; (2) double-stranded RNAs (dsRNA) [15]; and (3) virus- 
derived small interfering RNAs (siRNAs) [16]. Large numbers of 
both known and new plant and fungus infecting DNA and RNA 
viruses and viroids have been detected using these approaches 
[12,17-21]. 

A major shortcoming of these metagenomic approaches is, 
however, that they remain technically cumbersome and too 
expensive for routine diagnostic applications on collections of 
eukaryotic hosts - even if barcoded primers are used to bulk- 
sequence pooled samples from multiple sources [15]. Although 
prohibitive for high throughput diagnostics, the costs of NGS in 
the context of viral diversity research are often offset by the vast 
volumes of useful data that can be generated on viral population 
dynamics, co-infections, mutation frequencies and genetic recom- 
bination [22-24]. 

Here we describe the application of siRNA- and VANA- 
targeted NGS approaches to the analyses of two Egyptian 
sugarcane plants maintained for a number of years at the CIRAD 
Sugarcane Quarantine Station in Montpellier, France. These 
plants were both known to be infected with Sugarcane streak 
Egypt Virus (SSEV; Family Geminiviridae, Genus Maslrevirus) 
and were maintained for use as positive controls during the 
application of diagnostic tools for SSEV detection in sugarcane 
plants passing through the quarantine station. Using the siRNA- 
and VANA-targeted NGS approaches, we discovered and 
characterized a novel highly divergent mastrevirus from tlu'se 
two plants. This novel virus was also identified in other sugarcane 
plants originating from Sudan that exhibited white spots on the 
base of their leaf blades that become fused laterally, so as to appear 
as chlorotic stripes. Accordingly, we have proposed naming this 
virus Sugarcane white streak Virus (SWSV). In addition, we 
present a detailed analysis of siRNAs derived from the SWSV and 
SSEV variants infecting the two analysed sugarcane plants. 



Materials and Methods 

Plant material and sugarcane quarantine DMAs collection 

Leaves presenting topical symptoms of sugarcane streak disease 
were sampled from two sugarcane plants that had previously been 
found to be infected witii Sugarcane streak Egypt virus (SSEV) 
and had been kept in a quarantine greenhouse at the CIRAD 
Sugarcane Quarantine Station, in Montpellier, France. The two 
sugarcane plants, VARX and USDA (which was initially 
maintained at USDA-APHIS Plant Germplasm Quarantine 
before being transferred to CIRAD in 2007), were both initially 
collected in Egypt during two independent sampling surveys in the 
late 1990s [25,26]. These sampling surveys were both carried out 
on experimental stations and commercial lands in close collabo- 
ration with Egyptian authorities (Sugar Crop Research Institute 
(SCRI), Dr Abdel Wahab I. Allam (Director of SCRI) regarding 
VARX; and Agricultural Genetic Engineering Research Institute, 
Dr N.A. Abdallah, and Dr M.A. Madkour regarding USDA). In 
addition, leaves from six sugarcane plants originating from Sudan 
(B0065, B0067, B0069, D0002, D0003 and D0005, Table SI) and 
maintained at the CIRAD Sugarcane Quarantine Station were 
also used (Material Transfer Agreements between CIRAD and 
Kenana Sugar Co. Ltd). DNAs from these six plants were 
extracted using the DNeasy Plant Mini Kit (Qiagen). In addition, 
DNA was extracted from an additional 18 frozen leaf samples (— 
20°C), including 17 samples originating from Sudanese sugarcane 
plants, which had passed through the Montpellier Quarantine 
station h(;twccn 2000 and 2009 and one which had been obtained 
from a sugarcani; seedling grown from sugarcane true seeds [fuzz] 
developed in Guadeloupe from a biparental cross involving plants 
H70-6957 and B86-049 using die DNeasy Plant Mini Kit (Table 
SI). 

VANA extraction from viral particles, cDNA amplification 
and sequencing 

One gram of leaf material from the VARX and USDA plants 
were ground in Hanks' buffered salt solution (HBSS) (1:10) with 
four ceramic beads (MP Biomedicals, USA) using a tissue 
homogeniser (MP biomedicals, USA). The homogenised plant 
extracts were centrifuged at 3,200 xg for 5 min and 6 ml of the 
supernatants were further centrifuged at 8,228 xg for 3 min. The 
resulting supernatants were then filtered through a 0.45 jtm sterile 
syringe filter. The filtrate was then centrifuged at 1 48,000 xg for 
2.5 hrs at 4°C to concentrate viral particles. The resulting pellet 
was resuspended overnight at 4'C in 200 |ll of HBSS. Non- 
encapsidated nucleic acids were eliminated by adding 15 U of 
bovine pancreas DNase I (Euromedex) and 1.9 U of bovine 
pancreas RNase A (Euromedex, France) followed by incubation at 
37°C for 90 min. Total nucleic acids were finally extracted from 
virions using a NucleoSpin 96 Virus Core Kit (Macherey-Nagel, 
Germany) following the manufacturer's protocol. The amplifica- 
tion of extracted nucleic acids was performed as described by 
Victoria et al. [14] and aimed at detecting both RNA and DNA 
viruses. Reverse transcriptase priming and amplification of nucleic 
acids were used for detecting RNA viruses. A Klenow Fragment 
step was included in the protocol in order to detect DNA viruses as 
demonstrated by Froussard [2 7] . Briefly, viral cDNA synthesis was 
performed by incubation of 10 |J.l of extracted viral nucleic acids 
with 100 pmol of primer DoDec (5'-CCT TCG GAT CCT CCN 
NNN NNN NNN NN-3') at 85°C for 2 min. The mixture was 
immediately placed on ice. Subsequentiy, 10 mM dithiothreitol, 
1 mM of each deoxynucloside triphosphate (dNTP), 4 |j.l of 5 x 
Superscript buffer, and 5 U of Superscript III (Invitrogen, USA) 
were added to the mixture (final volume of 20 jtl), which was then 
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incubated at 25°C for 10 min, followed by 42°C incubation for 
60 min and 70°C incubation for 5 min before being placed on ice 
for 2 min. cDNAs were purified using the QiaQuick PGR cleanup 
kit (Qiagen). Priming and extension was then performed using 
Large (Klenow) Fragment DNA polymerase (Promega). First, 
20 |il of cDNA in the presence of 4.8 |a,M of primer DoDec were 
heated to 95°C for 2 min and then cooled to 4°C. 2.5 U of 
Klenow Fragment, lOX Klenow reaction buffer and 0.4 mM of 
each dNTP (final volume of 25 nl) were added. The mixture was 
incubated at 37°C for 60 min followed by 75°C enzyme heat 
inactivation for 10 min. PGR amplification was carried out using 
5 |J.l of the reaction described above in a 20 (xl reaction containing 
2 |xM primer (LinkerMidSO primer for VARX: 5'-ATC GTA 
OCA GCC TTG GGA TGG TGG-3' and LinkerMid52 primer 
for USDA: 5'-ATG TGT GTA GGG TTG GGA TGG TGG-3'), 
and 10 |xl of HotStarTaq Plus Master Mix Kit (Qiagen). The 
following cycling conditions were used: one cycle of 9,T°C 
for 5 min, five cycles of Qo^C for 1 min, 50°G for 1 min, 72"C 
for 1.5 min, 35 cycles of 95°G for 30 sec, 50°G for 30 sec, 72°G for 
1.5 min +2 sec at each cycle. An additional final extension for 
10 min at 72°G was then performed. DNA products were pooled 
(VARX and USDA products and 94 additional products obtained 
from other quarantine samples), cleaned up using the Wizard SV 
Gel and PGR Clean-Up System (Promega) and sequenced on 1/ 
8* of a 454 pyrosequencing plate using GS FLX Titanium 
reagents (Beckman Goulter Gogenics, USA). 

siRNA extraction and sequencing 

The nucleic acid extraction and sequencing approach of Kreuze 
et al. [16] was used with slight modifications. Totsd RNAs were 
extracted from lOOmg of VARX fresh leaf material using Trizol 
(Invitrogen) following the manufacturer's instructions. Small RNA 
libraries were directly generated from total RNAs. Small RNAs 
ligated with 3' and 5' adapters were reverse transcribed and PGR 
amplified (30 sec at 98°G; [10 sec at 98°G, 30 sec at 60°G, 15 sec 
at 72°C] xl3 cycles; 10 min at 72°C) to create cDNA hbraries 
selectively enriched in fragments having adapter molecules at both 
ends. The last step was an acrylamide gel purification of the 140- 
150 nt amplified cDNA constructs (corresponding to cDNA inserts 
from siRNAs +120 nt from the adapters). Small RNA libraries 
were checked for quality and quantified using a 2 1 00 Bioanalyzer 
(Agilent). The library was tlu-n sc-quenced on one lane of a HiSeq 
lUumina as single-end 50 base reads. 

Sequence assembly 

Analyses of reads produced by either lUumina (siRNA 
sequencing) or 454 GS FLX Titanium (Amplified- VANA 
sequencing) were performed using GLG Genomics Workbench 
5.15. De novo assemblies of contigs were performed with a 
minimal contig size set at 100 bp and 200 bp for lUumina and 454 
GS FLX Titanium reads, respectively. A posteriori mapping of 
reads against the c()mpk't(; genomes of SWSV (once the fuU 
genome had been cloned and sequenced) or SSEV or against parts 
of these genomes were also performed using CLC Genomics 
Workbench 5.15. Primary sequence outputs have been deposited 
in the sequence read archive of GenBank (accession numbers: 
VANA_USDA dataset: SRR1207274; VANA_VARX dataset: 
SRR1207275; siRNA_VARX dataset: SRR1207277). 

SWSV genome amplification, cloning and sequencing 

Two partially overlapping SWSV specific PGR primer pairs 
were designed so as to avoid any potential cross-hybridization to 
63 representative species of the family Geminiviridae, including 
SSEV. These two primer pairs (pairl: SWSV_F1 forward primer 



5'-GGT GAA AGG TAT GGG AAA GA-3' and SWSV-Rl 
reverse primer 5'-AGG GTG TGT AGA TGG TTT GG-3'; and 
pair2 EGORI-IF forward primer 5'-GAA TTG GGA GAG GGT 
GGT A-3' and EGORI-2R reverse primer 5'-GAG TTG AAT 
TGG GGT AGG AAG GAG-3') were complementary to 
sequences within the rep gene of SWSV. Total DNAs from the 
two sugarcane plants described above (VARX and USDA) were 
extracted using the DNeasy Plant Mini Kit (Qiagen) and screened 
for SWSV using the two pairs of primers and GoTaq Hot Start 
Master Mix (Promega) following the manufacturer's protocol. 
Amplification conditions consisted of an initial denaturation at 
95°G for 2 min, 35 cycles at 94°G for 10 sec, 55°G for 30 sec, 
68°G for 3 min, and a final extension step at 68°G for 10 min. 
Amplification products of ~2.8 Kbp were gel purified, ligated to 
pGEM-T (Promega) and sequenced by standard Sanger sequenc- 
ing using a primer walking approach. 

Reverse transcriptase priming and amplification of nucleic acids 
were carried out in order to detect the intron of the rep gene. Total 
RNAs from VARX were extracted using the RNeasy Plant Mini 
Kit (Qiagen). DNase treatment of extracted RNAs was carried out 
using RQ^l RNase-Free DNase (Promega) following the manufac- 
turer's protocol. Viral cDNA synthesis was performed by 
incubation of 1 |ll of DNase treated RNAs with 15 ^U of RNase 
free water, 0.6 jiM of each primers (SWSV_F2: 5'-AGG ATG 
TGG TGG GAG TAA TT-3' and EGORI-2R: 5'-GAG TTG 
AAT TGG GGT AGG AAG GAG-3'), and 0.4 mM of mixed 
deoxynucloside triphosphate (dNTPs), 5 |il of 5X Qiagen OneStep 
RT-PGR Buffer and 1 nl of Qiagen OneStep RT-PGR Enzyme 
Mix. Tubes were first placed at 50°G for 30 min for cDNA 
synthesis. PGR amplification was then carried out using the 
following cycling conditions: One cycle of 95°C for 15 min, 35 
cycles of 94''G for 1 min, 55°G for 1 min, 72°G for 1 min. An 
additional final extension for 10 min at 72°G was then performed. 
Amplification products were gel purified, ligated to pGEM-T 
(Promega) and sequenced by standard Sanger sequencing. 

PCR detection tests 

DNAs extracted from 17 sugarcane plants originating from 
Sudan kept at — 20°G or six freshly extracted from plants 
maintained at the GIRAD Sugarcane Quarantine Station were 

screened for SWSV. DNA extracted from one sugarcane seedling 
grown from true seeds (fuzz) was also screened for SWSV. PGR 
amplification was carried out using the two pairs of primers 
described above (SWSV_F1 and SWSV_R1; EGORI-IF and 
EGORI-2R) using GoTaq Hot Start Master Mix (Promega) 
following the manufacturer's protocol. Amplification products of 
~2.8 Kbp were gel purified, ligated to pGEM-T (Promega) and 
sequenced as described above. Plants infected with SWSV were 
also screened for all known sugarcane-infecting mastreviruses:SM- 
garcane streak Egypt Virus, Sugarcane streak virus, Maize streak 
virus, Siigarcmie streak Reunion virus, Eragrostis streak virus and 
Sacchanim streak virus. PGR amplification was carried out using 

1 |J.l of DNA template in a 25 jxl reaction containing 0.2 |xM of 
each broad spectrum primer (SSV_1732F: 5'-GAR TGV AGR 
TTR TTY TGG GAG TA-3' and SSV_2176R: 5'-GAR TAG 
GTY TGH ATG MTH GAG A-3') and GoTaq Hot Start Master 
Mix (Promega) following the manufacturer's protocol. The 
following cycling conditions were used: One cycle of 95°G for 

2 min, 35 cycles of 94°G for I min, 53°G for I min, 72°G for 
1 min. An additional final extension for 10 min at 72°G was then 
performed. 
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Sequence analyses 

Six complete genomes of the novel mastrevirus were recovered 
from plants VARX, USDA, A0037, B0069, D0005 and E0144 
(Table SI) and were aligned with the genomes of representative 
mastreviruses using MUSCLE (with default settings) [28]. 
Similarly, the predicted replication associated protein (Rep) and 
capsid protein (CP) amino acid sequences encoded by the viruses 
within the fuU-genome dataset were also aligned using MUSCLE. 
Maximum likelihood phylogenetic trees were inferred for the full 
genomes (TN93+G+I nucleotide substitution model chosen as the 
best-fit using jModelTest [29]), Rep (WAG+G+F amino acid 
substitution model chosen as the best-fit using ProtTest [30]) and 
CP (rtREV-l-G-l-F amino acid substitution model chosen as the 
best-fit using ProtTest) datasets with PHYML [31]. Approximate 
likelihood ratio tests (aLRT) were used to infer relative supports for 
branches (with branches having <80% support being collapsed). 
AU pairwise identity analysis of the fuU genome nucleotide 
sequences, capsid protein (CP) amino acid sequences, replication 
associated protein (Rep) amino acid sequences and movement 
protein (MP) amino acid sequences were carried out using the 
MUSCLE-based pairwise alignment and identity calculation 
approach implemented in SDT vl.O [32]. The full genome 
sequence alignment of representative mastrevirus genome se- 
quences together with SWSV was used to detect evidence of 
recombination in SWSV using RDP 4.24 with default settings 
[33]. Sequences are deposited in GenBank under accession 
numbers (SWSV-A [SD-VARX-20 1 3] - KJ187746; SWSV-A 
[SD -USDA-2013] - KJ187745; SWSV-B [SD -B0069-2013] - 
KJ210622; SWSV-B [SD -D0005-2013] - KJ187747; SWSV-B 
[SD -E()144-2013] - KJ187748 and SWSV-C [SD -A0037-2013] - 
KJ187749). 

Test for associations between siRNAs and SWSV/SSEV 
genomic and transcript secondary structures 

The SWSV/SSEV fioll genome sequences and predicted 
unspliced complementary and virion strand transcripts were 
separately folded using Nucleic Acid Secondary Structure 

Predictor [34], with the sequence conformation set as circular 
DNA, at a temperature of 25°C. NASP generates a list of all 
secondary structures detectable within given DNA or RNA 
sequences and through simulations it demarcates a set of structures 
referred to as a "high confidence structure set" (HCSS), that 
confers a higher degree of thermodynamic stability (lower free 
energy) to the sequences than what would be expected to be 
achievable by randomly generated sequences with the same base 
composition (with a p< = 0.05). 

Given the genomic coordinates of pairing nucleotides within the 
HCSS, we investigated whether there was any significant trend for 
more reads (looking both at all reads collectively and at the 2 1 nt, 
22 nt, 23 nt and 24 nt long reads separately) occurring within 
secondary structures predicted to occur within (i) the full genomes, 
(ii) the virion-strand transcripts and (iii) the c()mpk'm(;ntary-strand 
transcripts. The reads were mapped to the secondary structures 
and we counted how many nucleotides were located at paired and 
unpaired sites. While Kolmogorov-Smimov tests (implemented in 
R; www.r-project.org) were used to determine whether the 
distribution of reads between paired and unpaired sites were 
difierent, WUcoxon rank-sum tests (also implemented in R, www.r- 
project.org) indicated whether there were significantly more reads 
at paired sites compared to unpaired sites and vice versa. Whereas 
the Kolmogorov-Smirnov tests were used to indicate whether any 
associations existed between siRNA locations and base pairing 
within nucleic acid secondary structures, the WUcoxon rank-sum 
tests were used to determine whether detected associations were 



positive (siRNAs tended to occur at structured sites) or negative 
(siRNAs tended to occur outside of structured sites). 

Results 

454-based sequencing of VANA from the VARX and 
USDA sugarcane samples 

This approach was used in an attempt to detect both RNA and 
DNA viruses that may be present in the two sugarcane plants [27]. 
A total of 2612 and 1635 reads were respectively obtained from 
the VARX and USDA plants following length and quality 
filtering. One hundred and eight and 18 contigs were produced 
by de novo assembly from the VARX- and USDA-derived reads, 
respectively. Two contigs from the VARX plant (2706 nt and 
412 nt) and two from the USDA plant (2706 nt and 649 nt), 
encoded proteins with between 91 and 100% sequence identity 
with previously described SSEV proteins (Tablel). BLASTx 
analysis revealed that an additional two contigs from the VARX 
plant (2122 and 127 nt) and three contigs from the USDA plant 
(1836, 196 and 312 nt) were homologous with known mastre- 
viruses but were nevertheless only distandy related to mastrevirus 
sequences currently deposited in GenBank (Tablel). 

A posteriori mapping of VANA 454 reads obtained from the 
VARX and USDA plants against the complete SWSV genome 
(see below), revealed drat 23.9% (625/2612) and 16.1% (264/ 
1635) of the total reads were derived from this genome and that 
these yielded complete genome coverage at an average depth of 
8 IX and 29X, respectively. Interestingly, a ~ 120 nt long region of 
very low coverage (<4X) was identified, which mapped to the 
large intergenic region (LIR) of the SWSV genome (Figure 1). 

A mapping analysis performed with the genome of SSEV 
indicated that the corresponding values were 53.5% of reads 
(1398/2612, 159X average coverage depth) and 75% of reads 
(1227/1635, 138X coverage) for the VARX and USDA plants, 
respectively (Figure SI). 

siRNA lllumina sequencing from the VARX sugarcane 
plant 

A total of 15,275,640 raw reads were generated from the 
VARX sugarcane sample, which were then filtered down to 
3,945,108 high quahty reads in the 21 to 24 nt size range of 
siRNAs. From these reads, 226 contigs were obtained by de novo 
assembly, six of which showed significant degrees of similarity to 
mastreviruses based on BLASTx [35] searches (Table2). Of these 
six contigs, two (contigs #121 and #176) had a high degree of 
identity to SSEV while the remaining four were more distandy 
related to known mastreviruses. Three of these four contigs 
(contigs #44, #86 and #101) apparentiy corresponded with a 
mastrevirus capsid protein (CP) gene and the other one (contigs 
#79) with a movement protein (MP) gene, while the cumulative 
contig length of 761 bp corresponded to slightly more than a 
quarter of a typical mastr(;virus genome (Table2). 

Following the cloning and sequencing of the fuU genome of the 
new mastrevirus (SWSV; see below) it was determined that 0.59% 
of the lllumina reads obtained from the VARX plant could be 
mapped to this genome (Figure 1) to generate contigs that covered 
96.3% of the genome at an average depth of 185X with only se\-cn 
gaps of between three and 40 nucleotides. These gaps were located 
within the large intergenic region (three gaps) and within the 
probable rephcation associated protein (Rep) gene (four gaps; 
Figure 1) encoded by the CI ORE. It is noteworthy that the 
~120 nt long region of ver)^ low coverage (<4X) identified using 
the VANA approach mapped to the same part of the LIR region 
of the SWSV genome that remained uncovered during the 
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Illumina-based siRNA sequencing (Figure 1). As has been 
previously observed for other viruses, genome coverage was highly 
heterogeneous (Figure 1). However, a clear general trend could be 
observed, with the region corresponding to the virion sense VI 
and V2 ORFs (encoding CP and MP proteins, respectively), 
showing an average coverage depth of ~436X and the comple- 
mentary sense CI ORF showing an average coverage of only 
~38X. Coverage of the non-coding large and small intergenic 
regions and the presumed CI ORF intron were even lower at 
17.5X and 6.8X, respectively. 

It is also noteworthy that besides differences in coverage depth, 
these various genomic regions of SWSV also showed differences in 
the siRNA size classes that they yielded. While there was an 
enrichment of the 2 1 and 22 nt siRNA size classes amongst the 
total siRNA reads mapping to the VI and V2 ORFs, there was a 
depletion of the 2 1 nt siRNA size classes and an enrichment of the 
24 nt size class amongst total siRNA reads mapping to the CI 
ORF (Figure 2). The LIR and, to a lesser extent, the SIR showed a 
pattern similar to the CI ORF region (data not shown). The CI 
intron, however, had an extreme over-representation of the 24 nt 
size class with the other size classes being either nearly (22 nt) or 
totally (21 and 23 nt) absent (Figure 2). 

Since the VARX plant was also infected with SSEV, a similar 
analysis of SSEV-derived siRNAs was performed. Mapping 
against the genome of SSEV (NC_001868) demonstrated that 
0.17% of total reads (6572) were derived from it and that these 
reads covered 98.6% of the SSEV genome at an average depth of 
55X, leaving only 4 gaps of between 5 and 15 nucleotides (Figure 
SI). Although showing some high degrees of local heterogeneity, 
genome coverage of SSEV was less biased when comparing the 
diflFerent genomic regions. Nevertheless a similar trend to that 
associated with SWSV was observed with a higher depth of 
coverage for the virion sense V1-V2 ORFs (76. 5X) than for both 
the complementary sense CI ORF (37. 6X) and the non-coding 
regions (46X). Also, as for SWSV, the 21-22 nt siRNA size classes 
were enriched amongst those mapping to the virion sense ORFs 
and the 24 nt, siRNA size class was enriched amongst those 
mapping to the complementary sense CI ORF (Figure SI). 
However, unlike for SWSV, no strong siRNA size-class biases 
were observed for the non-coding regions (data not shown). 

By collectively using the lUumina siRNA reads and the 454 
VANA reads it was possible to assemble a single genome of the 
novel mastrevirus from both the VARX and USDA plants. 

SWSV 

Associations between siRNAs and SWSV/SSEV genomic 
and transcript secondary structures 

It has been previously determined that nucleic acid structures 
can have an appreciable impact on both the distribution of siRNA 
targets [36,37], and the operational efficiency of small RNA 
mediated anti-viral and anti-viroid defences [37,38]. We detected 
strong evidence for the presence of ssDNA secondary structures in 
both the SWSV (30 high confidence structure set (HCSS) 
identified) and SSEV (29 HCSS structures identified) genomes 
(Table 3). The distributions of the HCSS structural elements were, 
however, different in the predicted virion and complementary 
strand transcripts of the two viruses, with only two HCCS 
structures detected in the SWSV complementary strand transcript 
and none being detected in the SSEV virion strand transcript (so 
that this particular transcript was not analysed further). 

We detected a strong association between the absence of 
predicted secondary structures within the ssDNA SWSV genome 
and increased frequencies of corresponding 22, 23 and 24 nt long 
siRNAs (p-values <0.008; Table 3). Curiously, we found a 
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Figure 1. SWSV genome coverage following NGS. The genomic organization of SWSV is schematically shown above the graph. While relative 
degrees of coverage achieved after a posteriori mapping of reads produced by lllumina-based siRNA sequencing against the SWSV genome is 
indicated in green, the coverage achieved after mapping reads produced by 454 GS FLX Titanium-based VANA sequencing is indicated in blue. 
doi:1 0.1 371/journal.pone.01 02945.g001 



difierent association when considering the predicted SWSV RNA 
transcripts with 2 1 nt siRNA reads displaying a strong tendency to 
correspond with nucleotide sites tliat were predicted to be base 
paired in both the virion and complementary strand transcripts (p- 
values <2.49xl0-6) and the 22, 23 and 24 siRNA size classes 
displaying a similar tendency with respect to the virion strand 
transcript (p-values < 6.07 x 10-13). 

Similar to SWSV, for the SSEV foil genome there was an 
association between the absence of ssDNA structural elements and 
increased frequencies of 22 nt siRNAs. Also similar to SWSV the 
22, 23 and 24 nt long siRNAs display a significant tendency to 
correspond with transcript nucleotides that are base-paired within 
secondary structural elements. 

A novel sugarcane-infecting mastrevirus originating from 
the Nile region 

The complete genome of SWSV, as recovered from the VARX 
and USDA plants, is most similar to that of Wheat dwarf India 
virus (WD IV, Accession number NC_0 17828), with which it 
shares 61% genome-wide identity. Whereas the Rep and MP 
amino acid sequences of SWSV are also most similar to those of 
WDIV (54.4% and 44.8% identity, respectively), the CP is most 
similar to that of Panicum streak virus (PanSV, NC_001647, 
51.4-53.9%). Based on the 78% species demarcation threshold set 
by the Geminivirus study group of the ICTV [32], it is clear that 
the novel mastrevirus should be considered a new species within 
the genus Mastrevirus of the Family Geminiviridae (Figure S2). 
This is further confirmed by phylogenetic analyses performed on 
both the fuU genome (Figure 3) and on the amino acid sequences 
of its encoded proteins (Figure 4). The new virus clearly clusters 
with mastreviruses on a branch that is not closely associated with 
any other species classified within this genus. Whereas the CP of 
SWSV clusters within the virus clade including the various African 
streak viruses, Australasian striate mosaic viruses, Digitaria streak 
virus (DSV) and WT)IV, the Reps cluster with the African streak 
viruses and WDIV (Figure 4). 

SWSV was not detected in sugarcane seedlings derived from 
sugarcane true seeds under sterile insect-proof conditions, in 
agreement with the fact that seed transmission of geminiviruses has 



not so far been reported. The novel mastrevirus was, however, 
detected in five sugarcane plants originating from Sudan (A0037, 
B0065, B0069, D0005 and E0144) out of the 23 screened (Table 
SI). 

Complete SWSV genomes from four sugarcane plants (A0037, 
B0069, D0005 and E0144) were cloned and sequenced. The 
genomes of these isolates have >91% genome-wide identity with 
those recovered from the VARX and USDA sugarcane plants. 
Phylogenetic analyses of the fuU genomes (Figure 3) and of the 
amino acid sequences that they likely encode (Figure 4) confirmed 
that the isolates obtained from the Sudanese sugarcane plants also 
belong to the SWSV species. The six isolates can be further 
classified into 3 strains, SWSV-A (VARX, USDA), -B (B0069, 
D0005, E0144) and -C (A0037) (Figure S3) based on the proposed 
classification of mastrevirus strains outlined by Muhire et al. [32]. 

Using primers that allow the amplification of aU sugarcane- 
infecting mastreviruses, including Sugarcane streak Egypt Virus, 
Sugarcane streak virus. Maize streak virus, Sugarcane streak 
Reunion virus, Eragrostis streak virus and Saccharum streak virus, 
the five sugarcane plants originating from Sudan were shown to be 
free of co-infection with other known mastreviruses. Three of them 
are still maintained at the CIRAD sugarcane quarantine station 
(B0065, B0069 and D0005) and exhibit white spots on the base of 
their leaf blades, around the blade joint where the two wedge 
shaped areas called "dewlaps" are located (Figure S4). These spots 
can become fused laterally, so as to appear as chlorotic stripes 
(Figure S4). It is noteworthy that the SWSV infected D0005 plant 
displayed very Httle evidence of these spots (only one leaf out of 
eight exhibited tiny white spots that resembled thrip damage) and 
it is therefore very likely that SWSV infections could escape visual 
inspection (Figure S4). Given that three of the infected sugarcane 
varieties exhibited mild foliar symptoms, i.e. white spots on the 
base of their leaf blades that become fused laterally, so as to appear 
as chlorotic stripes, we propose naming the new species Sugarcane 
white streak Virus. 

Genome analysis of SWSV 

The SWSV genomes recovered from the various sugarcane 
plants were between 2828 and 2836 nt and are, in almost all 
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respects, very similar to those of all other previously described 
mastreviruses. The one exceptional feature of the SWSV genomes 
is that in case of the VARX and USD A isolates alternative splicing 
of complementary sense transcripts likely results in the expression 
of both a standard Rep (which is predicted to be 396 amino acids 
long), and a rather unusual Rep A of 418 amino acids long. This is 
the only known occurrence in any geminivirus of a RepA that is 
larger than Rep. 

Given the uniqueness of this apparent genome organisation in 
the USDA and VARX isolates the correct identification of the 
intron within the complementary sense transcript was verified. 
RT-PCR reactions targeting the complementary sense transcript 
clearly indicated the presence of a mixture of spliced and non- 
spliced complementary sense mRNA transcripts, and confirmed 
that the correct locations of the acceptor and donor sites of the 
66 nt long SWSV intron had been identified (Figure S5). 

Analysis of recombination 

All the SWSV genome sequences determined here share 
evidence of the same ancestral recombination event in the short 
intergenic region - corresponding to genomic coordinates 1419- 
1468 in the USDA isolate (p = 3.821 x 10-7 for the GENECONV, 
MAXCHI and RDP methods implemented in RDP4.24). 
Corresponding coordinates are known to be very common sites 
of recombination in mastreviruses [39] and the fragment that they 
delimit in SWSV has apparendy been derived from something 
resembling an African streak virus. 

Discussion 

We have performed NGS-based analyses of both siRNA and 
VANA isolated from sugarcane plants originating from Egypt. 
Both sequence-independent NGS approaches revealed the pres- 
ence of a novel mastrevrrus, SWSV, which had so far escaped 
routine quarantine detection assays, possibly because it was 
present in mixed infection with SSEV. The procedures used for 
the discovery of SWSV pave the way towards the application of 
NGS-based quarantine detection procedures. Such procedures 
would likely be hierarchical with a first stage sequence-indepen- 
dent NGS step followed by sequence-dependent secondary assays. 
Whereas the first step would be to identify novel viruses within a 
single plant (perhaps one displaying apparent disease symptoms), 
the second step would be to use sequence dependent approaches to 
both confirm the presence of any novel virus(es) identified in the 
original host, and identify the presence of this(ese) virus(es) in 
larger plant collections. A major strength of such an approach is 
that it would also yield complete genome sequences. 

The present study also confirms that both VANA [13] and 
siRNA [16] can be successfully targeted by metagenomics 
approaches for the discovery and characterization of plant- 
infecting DNA viruses. The VANA-based 454 pyrosequenciiig 
approach has several advantages as it initially combines reverse- 
transcriptase priming and a Klenow Fragment step, which 
potentially enables the detection of both RNA and DNA viruses. 
Additionally, up to 96-tagged amplified DNAs (cDNA and DNAs 
amplified using the Klenow Fragment step) can be pooled and 
sequenced in multiplex format [15] making this approach very 
useful for routine diagnostic screening of plants within BRCs and 
quarantine stations. However, validation using plants infected or 
co-infected with RNA and DNA viruses needs to be carried out in 
order to determine the sensitivity and specificity levels of this 454- 
based VANA sequencing approach. 

Virus-derived siRNAs naturally accumulate in virus-infected 
plants as a consequence of the action of Dicer enzymes as part of 
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Figure 2. Size distribution of sequenced siRNAs obtained from tKie VARX plant. The histograms represent the numbers of siRNA reads in 
each size class. (A) The size distributions of total reads, (B) The size distributions of reads mapping to the rep gene C-sense intronic region of SWSV, (C) 
The size distributions of reads mapping to the V1-V2 ORFs region of SWSV and (D) The size distributions of reads mapping to the CI ORF region of 
SWSV. 
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the RNA silencing-based plant antiviral defences [40] . Adopting a 
metagenomic approach and randomly sequencing these siRNAs is 
therefore an extremely powerful way to discover and characterise 
previously unknown plant viruses and viroids [16,41]. In addition 
to providing evidence for the presence of the two mastreviruses co- 
infecting the VARX sugarcane plant, this approach provided 
information on the interaction of the plant antiviral silencing 
machinery and these two viruses. Although these aspects have 
been studied previously in geminiviruses in the Begomovirus genus 
(Blevins et al., 2006; Akbergenov et al., 2006; Rodriguez-Negrete 
et al., 2009; Yang et al., 2011; Aregger et al., 2012), very little 
comparable information has previously been available for 
mastreviruses. 

The siRNA distributions observed here for SWSV and SSEV, 
perhaps unsurprisingly, seem to largely parallel those previously 
reported for begomoviruses. In particular, the difiFerences in size 
classes observed between different genome regions suggest that 
mastreviruses are subject both to transcriptional gene silencing, 
based on 24 nt long siRNAs produced through the action of 
DCL3, and to post-transcriptional gene silencing (PTGS) medi- 
ated by the 21-22 nt long siRNAs produced through the action of 
the antiviral Dicers DCL4 and DCL2 (Rodriguez-Negrete et al., 
2009; Aregger et al., 2012). The action of the former mechanism is 
particularly evident in the siRNAs mapping to the SWSV intron 
but is also, to a lesser extent, evident in the siRNAs mapping to 
both the non-coding regions and the complementary sense ORFs 
of SWSV and SSEV. On the other hand, the 21-22 nt siRNA size 
classes associated with PTGS are particularly evident in the 
siRNAs mapping to the two virion sense ORFs which are known 
to be more actively transcribed in mastreviruses than their 
complementary sense counterparts [42]. 

For both SWSV and SSEV we detected a significant association 
between the frequencies of siRNAs and the presence/ absence of 



predicted secondary structures within both the single stranded 
DNA (ssDNA) genomes of these viruses and their predicted single 
stranded RNA (ssRNA) complementary and virion strand 
transcripts. However, whereas significandy more siRNAs corre- 
sponded with unstructured regions of the ssDNA genome, for the 
transcripts significandy more siRNAs corresponded with struc- 
tured regions of ssRNA. It is plausible that base-paired nucleotides 
within transcript RNA molecules are protected from siRNA 
binding and that the secondary structures evident both in 
transcripts produced by SWSV, SSEV and in mastrevrrus 
genomes in general [43] may represent an evolutionary adaptation 
for viral persistence. In mammalian RNA viruses there is an 
association between degrees of genomic secondary structure and 
infection duration with viruses having highly structured genomes 
tending to cause chronic infections and viruses with unstructured 
genomes tending to cause acute infections [44,45]. 

In both analysed Egyptian sugarcane accessions, VARX and 
USDA, SWSV was found to be present in co-infections with 
SSEV. Both sugarcane plants were independently collected in 
Egypt which suggests that SWSV infection of Egyptian sugarcane 
plants may not be a rare phenomenon. SWSV was also detected in 
SSEV-free plants that originated from Sudan. It is noteworthy that 
one of the Sudanese plants from which SWSV was isolated, 
E0144, was initially grown in Sudan in 1992 before being 
transferred to Barbados in 1998 and subsequently sent back to the 
CIRAD Sugarcane Quarantine Station in 2009 (unpublished data, 
CIRAD Sugarcane Quarantine Station). Assuming that SWSV 
did not infect this plant in Barbados between 1998 and 2009, it is 
plausible that SWSV was present along the Nile basin at least from 
the late 1980s. Interestingly, as a consequence of indel polymor- 
phisms in the 66 nt long SWSV intron, the Egyptian SWSV 
isolates VARX and USDA have a highly unusual genome 
organization and likely express a RepA protein that, while having 
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Figure 3. Maximum-likelihood phylogenetic tree of 63 virus isolates representing each known mastrevirus species (including major 
strains) and the 6 SWSV isolates determined in this study. Tree branches are coloured according to the geographical origins of the viruses. 
Branches marked with filled and open circles respectively have >95% and 80-94% approximate likelihood ratio test support; branches having <80% 
support were collapsed. The phylogenetic tree is rooted using the full genome sequence of Dicot-infecting mastreviruses. 
doi:1 0.1 371 /journal.pone.01 02945.g003 



the same N- and C-terminus sequences as Rep, is 22 amino acids 
longer than Rep. 

The recent discoveries of SWSV and other higlily divergent 
mastreviruses [46,47] suggest that this geminivirus genus likely 
encompasses a far greater diversity and has a greater global 
distribution than has been previously appreciated. The SWSV 
isolate from the Sudanese sugarcane plant that had been 
propagated in Barbados represents only the third instance of 
discovery of mastreviruses in the New World [16,48], and suggests 



that there may have been other undetected recent introductions of 
mastreviruses to the Americas. Although insect transmission of 
mastreviruses in the New World remains to be reported, it is 
noteworthy that one of the three mastrevirus species that has so far 
been detected in the Americas was isolated from a dragonfly which 
had possibly eaten a plant feeding insect that was carrying the 
virus [48]. The presence of SWSV in Barbados offers an 
opportunity to investigate possible natural transmission of the 
virus by screening sugarcane planted near the SWSV infected 
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Figure 4. Maximum-likelihood phylogenetic tree of Rep (A) and CP (B) proteins. Tree branches are coloured according to the geographical 
origins of the viruses. Branches marked with filled and open circles are respectively have >95% and 80-94% approximate likelihood ratio test 
support; branches having <80% support were collapsed. 
doi:10.1371/journal.pone.0102945.g004 



E0144 accessions. Phylogenetic analyses of any SWSV genomes 
sampled from such plants should reveal their likely recent 
transmission histories. 

Given the relatively high degrees of sequence divergence 
observed between the different SWSV isolates described here 
(~9%), it is plausible that the natural geographical range of SWSV 
is broader than just the Nile basin. Also, the global dissemination 
of sugarcane cuttings, the absence of SWSV diagnostic tools, and 
the fact that SWSV induces, at least in one case, extremely mild 
symptoms in sugarcane imply that SWSV may have already been 
unknowingly distributed throughout the sugarcane growing 
regions of the world. The failure of established sugarcane 
quarantine diagnostics in this regard provides a dramatic example 
of how potentially pathogenic viruses can evade the screening 
procedures of quarantine facilities and may spread worldwide 
through international plant material exchanges. In this regard the 



situation with SWSV might closely match that of Sugarcane yellow 
leaf virus (SCYLV), which remained unnoticed for at least 30 
years during its spread throughout the world [49]. In order to 
accurately determine the potential economic impacts of the 
dissemination of SWSV, additional studies assessing the pathoge- 
nicity of this virus are certainly warranted. 

Our study stresses both the potential advantages of NGS-based 
virus metagenomic screening in a plant quarantine setting, and the 
need to better assess viral diversity within plants that are destined 
for exotic habitats. It indicates that a combination of sequence- 
independent NGS-based partial viral genome sequencing coupled 
with sequence-dependent Sanger-based fuU genome cloning and 
sequencing is likely to reduce the number of non-intercepted virus 
pathogens passing through plant quarantine stations, while at the 
same time alerting authorities to the presence and potential spread 
of viruses with unknown pathogenic potentials. 
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Supporting Information 

Figure SI (A) Genome coverages obtained after a posteriori 
mapping against the complete genome of SSEV of reads produced 
by Illumina (siRNA sequencing). The genomic organization of 
SSEV is schematically shown at the top of the figure. (B) Size 
distribution of sequenced siRNAs obtained from the VARX plant 
mapping on the V1-V2 ORFs region of SSEV. Histograms 
represent the number of siRNA reads in each size class, and (C) 
Size distribution reads mapping on C1-C2 ORFs region of SSEV. 
(TIF) 

Figure S2 Two-dimensional genome-wide percentage 
pairwise nucleotide identity plot of monocot-infecting 
mastreviruses including the six novel SWSV isolates 
from this study. 

(TIF) 

Figure S3 (A) Maximum-likelihood phylogenetic tree of six 
SWSV isolates. The six isolates can be classified into 3 strains, 
SWSV-A (VARX, USDA), -B (B()069, D(){)()5, E0144) and -C 
(A0037). (B) Genome-wide pairwise nucleotide similarity score 
matrix, the 94% strain demarcation threshold set by the 
Geminivirus study group of the ICTV (Muhrre et al. 2013) is 
indicated (green coloured below 94% and pink-red coloured above 
94%). 
(TIF) 

Figure S4 Symptoms caused by SWSV on B0065, B0069 
and D0005 plants. 

(TIF) 
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